========================================================

Data analysis and exploration

## [1] "/Users/Dalal/Desktop"
##  [1] "fixed.acidity"        "volatile.acidity"     "citric.acid"         
##  [4] "residual.sugar"       "chlorides"            "free.sulfur.dioxide" 
##  [7] "total.sulfur.dioxide" "density"              "pH"                  
## [10] "sulphates"            "alcohol"              "quality"

Introduction

in the Explore and Summarize Data project i will explore Red wine Data set, the main objective of this project is to explore the chemical variables that have impact on the wine this data set contain 12 variables and 1599 observations.

Univariate Plots Section

##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00      
##  Median :0.07900   Median :14.00       Median : 38.00      
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47      
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00      
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00      
##     density             pH          sulphates         alcohol     
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50  
##  Median :0.9968   Median :3.310   Median :0.6200   Median :10.20  
##  Mean   :0.9967   Mean   :3.311   Mean   :0.6581   Mean   :10.42  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.636  
##  3rd Qu.:6.000  
##  Max.   :8.000

From the summary i see that the Red wines have:

Chlorides: between 0.012 and 0.611 and the Mean value = 0.087 PH: between 2.740 and 4.010 and the Mean value = 3.311 alcohol: between 8.40 and 14.90 and the mean value = 10.42 Quality: between 3.000 and 8.000 and the Mean value = 5.636

The quality of red wine is normally distributed around 5, thats mean the quality of red wine collection is good

To see all the chemical variables that have the impact on the wine, and i fount that (residual sugar, chlorides, free sulfur dioxide , total sulfur dioxide, and sulphates) are positive skew and the (density and PH ) are normally distributed.

i grouped the free and total sulfur dioxide together and, from the histogram above i see that both free and total sulfur.dioxide have normal distributions.

## <ScaleContinuousPosition>
##  Range:  
##  Limits:    0 --    1
## <ScaleContinuousPosition>
##  Range:  
##  Limits:    0 --    1

i grouped the acids together and, from the histogram above i see that: The fixed acidity of red wine between 5 to 11 and The Volatile acidity of red wine between 0.2 - 0.8 and The citric acid of red wine between 0.01 - 0.50

from the chart above it seems that the alcohol content follow an abnormal distribution and it contains a high peak at the lower.

From the histogram above i define a new variable calles alcohol density depend on the alcohol to see the density of the alcohol and see the highest density low Alcohol with alcohol between (8.3 to 10.5) Medium Alcohol with alcohol between (10.55 to 12.5 ) high Alcohol with alcohol between (12.6 to 14.9) and i found that low has the highest Alcohol the count is around 1000 and then Medium has Alcohol around 550 and High has Alcohol around 60.

From the histogram above i define a new variable calles alcohol quality depend on quality low quality is < 5 medium < 7 v.good > 7 to see the highest average alcohol quality and found that is medium is the highest

Univariate Analysis

What is the structure of your dataset?

## 'data.frame':    1599 obs. of  14 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##  $ alcohol.density     : Ord.factor w/ 3 levels "Low Alcohol"<..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ alcohol.quality     : Ord.factor w/ 3 levels "low"<"Medium"<..: 2 2 2 2 2 2 2 3 3 2 ...

wine Dataset contain 1599 observation and 12 variables (fixed acidity - volatile acidity - citric acid - residual sugar chlorides - free sulfur dioxide - total sulfur dioxide - density - pH - sulphates - alcohol- quality )

and i create 2 variables alcohol category and alcohol quality.

What is/are the main feature(s) of interest in your dataset?

The wine Quality and Alcohol is the main features.

What other features in the dataset do you think will help support your  investigation into your feature(s) of interest?

sulphates and density.

Did you create any new variables from existing variables in the dataset?

yes, alcohol density and alcohol quality

Of the features you investigated, were there any unusual distributions?  Did you perform any operations on the data to tidy, adjust, or change the form  of the data? If so, why did you do this?

yes i observed some unusual distribution with the fixed acidity, citric acid, volatileacidity, free sulfur dioxide and total sulfur dioxide variables and i use log10 and to understand the distribution better.

Bivariate Plots Section

from the chart above i see that fixed acidity is increasing with density.

from the chart above i see that density is increasing while the alcohol decreasing

from the chart above i see that the relationship is negative, lower pH correlates with higher fixed acidity

from the chart above i see that residual sugar is increasing with density.

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

From the chart above i see that there is negative correlation between: citric acid and volatile acidity

from the chart above i see that the wine with High alcohol has the lowest meadian density.

from the chart above i see that the wine with the v.good quality has the lowest meadian volatile acidity.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the  investigation. How did the feature(s) of interest vary with other features in dataset?

I see there is a Positive correlation between: (fixed acidity with density) and (residual sugar with density). and Negative correlation between: (density with alcohol) and (fixed acidity with PH)

From the box plot, i see that there is a negative correlation between: (Alcohol quality with volatile.acidity). and positive correlation between: (Alcohol quality with alcohol).

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

negative correlation between (citric acid and volatile acidity) positive relationship between (fixed.acidity and density)

What was the strongest relationship you found?

alcohol quality with alcohol.

Multivariate Plots Section

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

From the previous plots i want to explore how the chemical variables interacts with Alcohol. High alcohol density has low (chlorides, volatile acidity and density) and free sulfur dioxide increasing with high alcohol density.

From the previous plots i want to explore correlation between alcohol quality and alcohol i found that the wine with high quality has medium alcohol, lower density , high citirc acid and low volatile acidity

From the previous chart i found that fixed acidity and density increasing but PH is decreasing.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the . Were there features that strengthened each other in terms of at your feature(s) of interest?

i observed that: -High alcohol density has low (chlorides, volatile acidity and density) -free sulfur dioxide increasing with high alcohol density. - wine with high quality has medium alcohol, lower density , high citirc acid and low volatile acidity -

Were there any interesting or surprising interactions between features?

High quality wines have lower volatile acidity.

OPTIONAL: Did you create any models with your dataset? Discuss the and limitations of your model.


Final Plots and Summary

Plot One

Description One

The quality of red wine showes that 80% have a good quality and the chart shows that is normally distributed around 5 to 6

Plot Two

Description Two

The charts above is normally distributed the average of free.sulfur.dioxide is 10

Plot Three

Description Three

The chart above shows that there is negative relationship between (fixed acidity and PH)


Reflection

in the begin of the project i try to know the dataset better and i found there is 1,599 observations with 13 variables and i remove x coulomn so there is 12 variables. and i notice throw the analyze process that the Quality has strong effect on the variables so i define a new variable “alcohol quality” to see it clearly in chart.

in the future when i have best quality for wine the graph for alcohol quality that i ploted will change and the (v.good) will change to become the highest value insted of medium. and as a next step i will develop a statistical model for Red wine dataset.